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Abstract 

In a recent paper Lima, Panario and Wang have provided a new method to multiply poly- 
nomials in Chebyshev basis which aims at reducing the total number of multiplication when 
polynomials have small degree. Their idea is to use Karatsuba's multiplication scheme to im- 
prove upon the naive method but without being able to get rid of its quadratic complexity. In 
this paper, we extend their result by providing a reduction scheme which allows to multiply 
polynomial in Chebyshev basis by using algorithms from the monomial basis case and there- 
fore get the same asymptotic complexity estimate. Our reduction allows to use any of these 
algorithms without converting polynomials input to monomial basis which therefore provide 
a more direct reduction scheme then the one using conversions. We also demonstrate that 
our reduction is efficient in practice, and even outperform the performance of the best known 
algorithm for Chebyshev basis when polynomials have large degree. Finally, we demonstrate a 
linear time equivalence between the polynomial multiplication problem under monomial basis 
and under Chebyshev basis. 

1 Introduction 

Polynomials are a fundamental tool in mathematics and especially in approximation theory where 
mathematical functions are approximated using truncated series. One can think of the truncated 
Taylor series to approximate a function as a polynomial expressed in monomial basis. In general, 
many other series are preferred to the classical Taylor series in order to have better convergence 
properties. For instance, one would prefer to use the Chebyshev series in order to have a rapid 
decreasing in the expansion coefficients which implies a better accuracy when using truncation 
[18l|5]. One can also use other series such as Legendre or Hermite to achieve similar properties. 
It is therefore important to have efficient algorithms to handle arithmetic on polynomials in such 
basis and especially for the multiplication problem [S] [7] . 

Polynomial arithmetic has been intensively studied in the past decades, in particular following 
the work in 1962 of Karatsuba and Ofmann [16 who have shown that one can multiply polynomials 
in a subquadratic number of operations. Let two polynomials of degree d over a field K be given in 
monomial basis, one can compute their product using Karatsuba's algorithm in 0{n}°^^ ^) operations 
in K. Since this seminal work, many other algorithms have been invented in order to asymptotically 
reduce the cost of the multiplication. In particular, one can go down to 0(n'°Sr+i(2i-+i)') operations 
in IK with the generalized Toom-Cook method [531 [TU] for any integer r > 0. Finally, one can even 
achieve a quasi-linear time complexity using the so-called FFT PT] assuming the field K have some 
nice properties (see |13i |6] for a good introduction) . One of the main concern of this work is that 
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all these algorithms have been designed for polynomials given in monomial basis, and they do not 
directly fit the other basis, such as the Chebyshev one. 

In this work, we extend the result of Lima, Panario and Wang [17] which is to directly use 
Karatsuba's algorithm [16] within the multiplication of polynomials given in Chebyshev basis. In 
[17] the authors partially succeeded in such a task but without being able to reach Karatsuba's 
asymptotic complexity. Our approach here is more general and it endeavors to completely reduce 
the multiplication in Chebyshev basis to the one in monomial basis. Of course, one can already 
achieve such a reduction by using back and forth conversions between the Chebyshev and monomial 
basis using methods presented in [31I3]. However, this reduction scheme is not direct and it implies 
at least three calls to multiplication in monomial basis: two for the back and forth conversions 
and one for the multiplication of the polynomials. Note that it is not even clear from [31 13] that 
basis conversions are equivalent to only one multiplication in monomial basis. In this work, we 
present a new reduction scheme which does not rely on basis conversion and therefore reduces the 
number of multiplications in monomial basis to only two. We also demonstrate that degenerating 
this reduction for the case of DFT-based multiplication algorithm reduces the number of operations. 
Considering practical efficiency, we will see that our degenerated reduction scheme will definitively 
compete with implementations of the most efficient algorithms available in the literature. 

Organization of the paper. Section [2] recalls some complexity results on polynomial multiplica- 
tion in monomial basis and provides a detailed study on arithmetic operation count in the case of 
polynomials in M.[x]. In Section [3] we give a short review on the available methods in the literature 
to multiply polynomials given in Chebyshev basis. Then, in Section S] we propose our new method 
to perform such multiplication by re-using multiplication in monomial basis. We analyze the com- 
plexity of this reduction and compare it to other existing methods. We perform some practical 
experimentations of such a reduction scheme in Section [S] and then compare its efficiency and give 
a small insight on its numerical reliability. Finally, we exhibit in Section [6] the linear equivalence 
between the polynomial multiplication problem in Chebyshev basis and in monomial basis with 
only a constant factor of two. 

2 Classical Polynomial Multiplication 

It is well-known that polynomial multiplication of two polynomials of K[a;] with degree d — n ~ 1 
can be achieved with less than O(n^) operations in K, for any field K (see [T3JIS]), if polynomials are 
given in monomial basis. Table [T] exhibits the arithmetic complexity of two well known algorithms 
in the case of polynomials of M.[x]. One is due to Karatsuba and Ofman [16] and has an asymptotic 
complexity of 0(n'°S2 3) operations in K; the other one is based on DFT computation using complex 
FFT and it has an asymptotic complexity of O(nlogn) operations in K, see [13', algorithm 8.16] 
and [UllH] for further details. One can see [TSKH] for more details on complex FFT. We also give 
in Table [1] the exact number of operations in K for the schoolbook method. From now on, we will 
use logn notation to refer to log2 n. 

To perform fast polynomial multiplication using DFT-based method on real inputs, one need to 
compute 3 DFT with 2n points, n pointwise multiplications with complex numbers and 2n multipli- 
cations with the real constant ^ . Note that we do not need to perform 2n pointwise multiplications 
since the DFT on real inputs has an hermitian symmetry property. Using Split-Radix FFT of |22| 
with 3/3 strategy for complex multiplication (3 real additions and 3 real multiplications), one can 
calculate the DFT with n points of a real polynomial with ^ log n — ^ + 2 real multiplications 
and ^ log n — ^ + 4 additions. Adding all the involved operations gives the arithmetic operation 
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Table 1: Exact number of operations to multiply two polynomials of M.[x] with degree n — 1 in 
monomial basis s.t. n = 2^" 



Algorithm 


nb. of multiplications 


nb. of additions 


Schoolbook 


2 


(n- 1)2 


Karatsuba 






DFT-based(*) 


3n log 2n — 4ri + 6 


971 log 2n- 12n+ 12 



(*) using real-valued FFT of \22^ with 3/3 strategy for complex multiplication 



count given in Table [T] Note that one can even decrease the number of operations by using the 
modified split-radix FFT of PJ], yielding an overall asymptotic complexity of ^nlog2n instead of 
12nlog2n. 

In the following, we will use the function M(n) to denote the number of operations in M to 
multiply polynomials of degree less than n when using the monomial basis. For instance, M(ri) — 
0(n'°S2 3) with Karatsuba's algorithm. In order to simplify the notations, we assume throughout 
the rest of the paper that polynomials are of degree d — n — 1 with n = 2^ . 

3 Polynomial Multiplication in Chebyshev Basis 

Chebyshev polynomials of the first kind on the interval [—1,1] are defined by 

Tk{x) = cos(fc arcos(x)), fc e N* and a; G [-1, 1]. 

According to this definition, one can remark that these polynomials are orthogonal polynomials. 
The following recurrence relation holds: 

Tk{x) = 2xTfe_i(x) - Tk-2{x) 
Toix) = 1 
Ti{x) = X 

It is obvious from this relation that the i-th Chebyshev polynomial Ti{x) has degree i in x. 
Therefore, it is easy to show that {Ti{x))i>o form a basis of the R-vector space of R[a;]. Hence, 
every polynomial / G M[x] can be expressed as a linear combination of Ti{x). This representation 
is called the Chebyshev expansion. In the rest of this paper we will refer to this representation as 
the Chebyshev basis. 

Arithmetic operations in Chebyshev basis are not as easy as in the classical monomial basis, in 
particular for the multiplication. Indeed, the main difficulty comes from the fact that the product 
of two basis elements spans over two other basis elements. The following relation illustrates this 
property: 

Tj+Ax) + T\i_i\(x) , , 

r,(x) r,(x) = ^±^^^^-^^^, v*,jgN (1) 
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3.1 Quadratic Algorithms 

According to ([T]), one can derive an algorithm to perform the product of two polynomials given 
in Chebyshev basis using a quadratic number of operations in M. This method is often called the 
"direct method" . Let two polynomials a,b € M.[x] of degree d = n — 1 given in Chebyshev basis : 

d ^ d 

a{x) = ^ + X! ^kTk{x) and b{x) = y + X! ^kTk{x). 



k=l k=l 



The 2d degree polynomial c{x) — a{x) b{x) G R[a;] expressed in Chebyshev basis can computed 
using the following formula [T]: 



c{x) = ^ + Y,CkTkix) 



2d 



2 



such that 



d 



ao^o + 2 aibi for fc 0, 

(=1 

k d~k 

2cA;=< '^ak-ibi +^(ai6fc+; + afe+;6;)for fc = 1, 1, (2) 

1=0 1 = 1 

d 

ak-ibi for fc = d, 2d. 

^ l=k — d 

The number of operations in M to compute all the coefficients of c{x) is exactly [Tl 117): 

• + 2n — 1 multiplications, 

(n- l)(3n-2) , . 

• additions. 

2 

Lima et. al recently proposed in |17j a novel approach to compute the coefficient of c{x) which 
reduces the number of multiplications. The total number of operations in R is then: 

• multiplications. 



• Sn^ + 3 _ 677 + 2 additions. 

The approach in [17j is to compute the terms ^ak-ibi using Karatsuba's algorithm |16j on 
polynomial a(x) and b{x) as if they were in monomial basis. 

Of course, this does not give all the terms needed in ©. However, by re- using all partial results 
appearing along the recursive structure of Karatsuba's algorithm, the authors are able to compute 
all the terms aibk+i + cik+ibi with less multiplication than the direct method. Even if the overall 
number of operations in M is higher than the direct method, the balance between multiplication 
and addition is different. The author claims this may have an influence on architectures where 
multiplier's delay is much more expensive than adder's one. 
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3.2 Quasi-linear Algorithms 

One approach to get quasi- linear time complexity is to use the discrete cosine transform (DCT-I). 
The idea is to transform the input polynomials by using forward DCT-I, then perform a pointwise 
multiplication and finally transform the result back using backward DCT-I. An algorithm using 
such a technique has been proposed in pP and achieves a complexity of Oinlogn) operations in M. 
As mentioned in [17] , by using the cost of the fast DCT-I algorithm of [9] one can deduce the exact 
number of operations in R. However, arithmetic operation count in |17) is partially incorrect, the 
value should be corrected to: 

• 3n log 2ri — 2n + 3 multiplications, 

• {9n + 3) log 2n - 12n + 12 additions. 

DCT-I algorithm of [9, costs §logn — n + 1 multiphcations and ^\ogn — 2n + \ogn + 4 
additions when using n sample points. To perform the complete polynomial multiplication, one 
needs to perform 3 DCT-I with 2n points, 2n pointwise multiplications and 2n multiplications by 
the constant Adding all the operations count gives the arithmetic cost given above. 

4 Reduction To Monomial Basis Case 

4.1 Using Basis Conversions 

One can achieve a reduction to the monomial basis case by converting the input polynomials given 
in Chebyshev basis to the monomial basis, then perform the multiplication in the latter basis and 
finally convert the product back. Hence, the complexity directly relies on the ability to perform 
the conversions between the Chebyshev and the monomial basis. In [U, authors have proved 
that conversion between these two basis can be achieved in 0(M(n)) operations. Assuming such 
reductions have a constant factor greater than or equal to one, which is the case to our knowledge, 
the complete multiplication in Chebyshev basis would requires an amount of operation larger than 
3M(7T.). In the next section, we provide a new reduction scheme which decrease the constant factor 
of the reduction to exactly two. 

4.2 Our Direct Approach 

As seen in Section 13.11 Lima et. aFs approach [17 is interesting since it introduces the use of 
monomial basis algorithms (i.e. Karatsuba's one) into Chebyshev basis algorithm. The main idea 
in |17) is to remark that the terms '^ak-ibi in ^ are convolutions of order k. Hence, they are 
directly calculated in the product of the two polynomials 

a{x) = ao + aix + a2x'^ + ... + adx"^ , 

b{x) = bo + bix + a2x'^ + ... + bdx''. (3) 

This product gives the polynomials 

fix) = a{x) b{x) = fa + fix + fiX^ + ... + f2dx"'. 

Each coefficient fk of the polynomial f{x) corresponds to the convolution of order k. Of course, this 
polynomial product can be calculated by any of the existing monomial basis algorithms (e.g. those 



5 



of Section [2]) . Unfortunately, this gives only a partial reduction to monomial basis multiplication. 
We now extend this approach to get a complete reduction. 

Using coefficients f{x) defined above one can simplify ([2]) to 



for fc = 0, 



2cfe = < 



1=1 

d-k 



fk + ^{aibk+i + ak+ibi) for k = 1, ...,d- 1, 



(4) 



1=1 



Jk 



for fc = d, 2d. 



In order to achieve the complete multiplication, we need to compute the three following sum- 
mation terms for k = I . . . d — 1 : 



d-k 



d-k 



(5) 



'^aih , '^aibk+i and ak+ik- 
1=1 1=1 1=1 

Let us define the polynomial f{x) as the reverse polynomial of a{x): 

r(x) = d{x^^)x'^ = ra + rix + r2X^ + . . . + r^x'^. 

This polynomial satisfies — a^-i for i = . . . d. Let the polynomial g{x) be the product of the 
polynomials f(x) and b{x). Thus, we have 

g{x) = f{x) b{x) = .90 + gix + g2X^ + • . • + g2dx'^'^ ■ 
The coefficient of this polynomials satisfies the following relation for k = . . .d : 



d-k 



d-k 



9d+ 



fc = ^ rd-ibk+i and ga-k = ^ ra-k-ik- 



1=0 



According to the definition of r(x) we have: 



d-k 



1=0 



d-k 



gd+k 



= ^ aibk+i and g^^fe = ^ ak+ih 



(6) 



1=0 



1=0 



All the terms defined in ([5|) can be easily deduced from the coefficients gd+k and gd-k of the 
polynomial g{x). This gives the following simplification for (|4]) 



fo + 2{gd - ao&o) 



for fc ^ 0, 



2ck^{fk + gd-k + gd+k - aobk - akbo for fc = 1, ...,d~ 1, 



(7) 



fk 



for fc = d, 2d. 



Applying ^ , one can derive an algorithm which satisfies an algorithmic reduction to polynomial 
multiplication in monomial basis. This algorithm is identified as PM-Chebyshev below. 
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Algorithm 1: PM-Chebyshev 



d 

Input : a{x), b{x) £ M.[x] of degree d = n — 1 s.t. a{x) — — + a^Tkix) and 

fc=i 



h{x)^^ + j^hMx). 




Output: c{x) e M[x] of degree 2d s.t. c{x) — a(x) h(x) = + ^ CkTk{x). 



begin 

let d{x) and b{x) as in ^ 
f{x) := d{x) b{x) 
g{x) := d{x) b{x^^) x'^ 

Co Y ^ 9d - aobo 
for fc — 1 to (i 1 do 

for fc = d to 2(i do 

Cfe -Jfc 

return 
end 



4.3 Complexity Analysis 

Algorithm PM-Chebyshev is exactly an algorithmic translation of ([7]). Its correctness is thus imme- 
diate from (m and dH). 

Its complexity is 0(M(n)) + 0{n) operations in M. It is easy to see that coefficients fk and gj^ 
are computed by two products of polynomials of degree d = n — 1 given in monomial basis. This 
exactly needs 2M{n) operations in R. Note that defining polynomials d{x),b{x) and f{x) does not 
need any operations in M. The complexity of the algorithm is therefore deduced from the number 
of operations in ([7]) and the fact that d ~ n — \. 

The exact number of operations in K of Algorithm PM-Chebyshev is 2M (rt) + 8n — 10. The extra 
linear operations are divided into 4n — 4 multiplications and 4n — 6 additions. It is possible to 
decrease these numbers by setting to zero the constant coefficient of the polynomials a{x) and b{x) 
(i.e. oo = 6o = 0) just before the computation of g{x). Indeed, this removes all the occurrences of 
ao and 5o in ^ which gives the following relation: 




(8) 
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and therefore simplifies (O to 



2ck=< 



fo + 25d 



for k — 0, 



fk + gd-k + gd+k for fc = 1, - 1, 



(9) 



fk 



for k — d, 2d. 



Embedding this tricks into Algorithm PM-Chebyshev leads to an exact complexity of 2M(n) + 
4n — 3 operations in R, where extra linear operations are divided into 2n — 1 multiplications and 
2n — 2 additions. 



Table 2: Arithmetic operation count in Algorithm PM-Chebyshev 



M(n) 


nb. of multiplication 


nb. of addition 


Schoolbook 


2^2 + 2n - 1 


2^2 - 2n 


Karatsuba 


2„i°g3 + 2n-l 


14n'°s3 _ I2n + 2 


DFT-based(*) 


6nlog2n — 6n + 11 


187ilog2n-22n + 22 



(*) using real-valued FFT of 1221 with 3/3 strategy for complex arithmetic 



Table [5] exhibits the exact number of arithmetic operation needed by Algorithm PM-Chebyshev 
depending on the underlying algorithm chosen to perform monomial basis multiplication. We 
separate multiplications from additions in order to offer a fair comparison to [T7] and we use results 
in Tabled] for M{n) costs. 



4.4 Special Case of DFT-based Multiplication 

When using DFT-based multiplication, we can optimize the Algorithm PM-Chebyshev in order to 
further reduce the number of operations. In particular, we can remark that Algorithm PM-Chebyshev 
needs two multiplications in monomial basis using operands d{x),b{x) and a{x),b{x~^)x'^ . There- 
fore, applying the generic scheme of Algorithm PM-Chebyshev, we compute twice the DFT transform 
of a{x) on 2n points. The same remark applies to the DFT transform of b{x) and f{x) — b{x^^)x'^ 
which can be deduced one from the other at a cost of a permutation plus 0{n) operations in R. 
Indeed, we have 

DFT2„(&) = [ b{w'') ]fc=0...2n-l, 
DFT2„(f) = [ b{w-'') LO'"' ]fc=0...2„-l. 

Since w — e~^^ by definition of the DFT, we have u — 1 and therefore : 

^k ^ ^k-2n ^-fe ^ ^2n-k ^ ^ 

This gives : 



DFT2„(f) = [ fe(u;2"-'=) Lo''^ 



Jfc=0...2n-1- 
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Table 3: Exact complexity for polynomial multiplication in Cliebyshev basis, with degree n — I 



Algorithm 


nb. of operations in M 


Direct method 


2.5^2 - 0.5n 


Lima et. al [IT 


3.5n2 +7i'°s3 - 3.5n+ 1 


DCT-based 


(12n + 3) log 2n - 14n + 15 


PM-Chebyshev (Schoolbook) 


4n2 - 1 


PM-Chebyshev (Karatsuba) 


16„iog3 _ iOn+ 1 


PM-Chebyshev (DFT-based) 


16nlog2n - 8n + 19 



Considering the DFT as an evaluation process, we have 

f{w^) = {ujdf h{w^''-^) for = . . . 2n - 1 

where ui^ = co'^ = e ^""^ . We can easily see that computing DFT2„(f) is equivalent to reverse the 
values of DFT2„(&) and multiply them by the adequate power of uJd- This process needs exactly a 
permutation plus 4n — 2 multiplications in R, which is much less than the O(rilogn) cost of the 
FFT. 

Remark 1. Instead of applying a permutation, one can use the hermitian symmetry property of real 
input DFT. In other words, it is equivalent to say that b{Lu'^"~^) is equal to the complex conjugate 
ofHu:"). 

This remark has no influences on the complexity analysis but for real implementation it replaces 
memory swaps by modifications of the sign in the complex numbers structure. If data does not fit 
in cache, this might reduce the number of memory access and cache misses, and therefore provide 
better performances. 

Using these considerations, one can modify Algorithm PM-Chebyshev in order to save almost 
the computation of 2 DFTs. Hence, we obtain an arithmetic cost in this case of: 

• An log 2n + 4n + 5 multiplications, 

• 12n log 2n — 12n + 14 additions. 

4.5 Comparisons With Previous Methods 

We now compare the theoretical complexity of our new method with existing algorithms presented 
in Section [H 

In Table [31 we report the exact number of operations in R for each methods. One can conclude 
from this table that the asymptotically fastest multiplication is the one using DCT [Ij. However, 
according to the constants and the non-leading terms in each cost function, the DCT-based method 
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Figure 1: Theoretical speedup of polynomial multiplication in Chebyshev basis with different cost 
models. 
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is not always the most efficient, especially when polynomial degrees tend to be very small. Further- 
more, we do not differentiate the cost of additions and multiplications which does not reflect the 
reality of computer architecture. 

In Figure [l] one can find the speedup of each methods compared to the Direct method. We 
provide different cost models to capture a little bit more the reality of nowadays computers where 
the delays of floating point addition and multiplication may differ by a factor of 4 at large. Note 
that both axis use a logarithmic scale. 

First, we can remark that changing cost model only affect the trade-off between methods for 
small polynomials (i.e. size less than 16). As expected for large degrees, the DCT-based method is 
always the fastest and our Algorithm PM-Chebyshev (DFT-based) is catching up with it since they 
mostly differ by a constant factor. However, when polynomial degrees tend to be small (less than 
10) the Direct method is becoming the most efficient even if it has a quadratic complexity. 

As already mentioned in |17| . the method of Lima et. al tends to become more efficient than 
the direct method for small polynomials when the cost model assumes that one floating point 
multiplication cost more than three floating point additions. However, practical constraint such 
as recursivity, data read/write or cache access have an impact on performance, as we will see, in 
Section [S] and need to be considered. 

5 Implementation and Experimentations 

In order to compare our theoretical conclusions with practical computations, we develop a software 
implementation of our Algorithm PM-Chebyshev and we report here its practical performances. As 
a matter of comparison, we provide implementations for previous know methods: namely the Direct 
method and the DCT-based method. For the Direct method, a naive implementation with double 
loop has been done, while for the DCT-one we re-use existing software to achieve best possible 
performances. 

5.1 A Generic Code 

We design a C-I--I- code to implement Algorithm PM-Chebyshev in a generic fashion. The idea 
is to take the polynomial multiplication in monomial basis as a template parameter in order to 
provide a generic function. We decided to manipulate polynomials as vectors to beneflt from the 
CH — h Standard Template Library |19j . and thus benefit from genericity on coefficients, allowing the 
use of either double or single precision floating point numbers. Polynomial coefficients are ordered 
in the vector by increasing degree. The code given in Figure [5] emphasis the simplicity of our 
implementation: 

The function mulM corresponds to the implementation of the multiplication in monomial basis 
while the function mulC corresponds to the one in Chebyshev basis. The vectors a and b represents 
the input polynomials and c is the output product. As expected, this code achieves a complete 
reduction to any implementation of polynomial multiplication in monomial basis, assuming the 
prototype of the function is compliant. In our benchmarks, we will use this code to reduce to a 
homemade code implementing the recursive Karatsuba's multiplication algorithm. 
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Figure 2: Generic C++ code achieving the reduction to monomial basis muhiphcation. 



template<c las s T , 


void mulM( vector <T>&, 






c o 1 1 st vector "^T!^^ ^ 






const vector <T>&)> 




void mulC ( 


vector <T>&: c , 






vector <CT^& s 




const 


vector<T>& b){ 




S12C_tj cl3j,clt),clc 






da— a . size ( ) ; db 


— b . size ( ) ; dc— c . size ( ) ; 




vector <T> r ( db ) 


,g(dc); 




for ( i =0: i<db : i 


++) 




r [ i] = b [db-1- 


i ] ; 




mulM ( c , a , b ) ; 






mulM ( g , a , r ) ; 






for (i=0:i<dc; + 


+ i) 




c [ i ] *=0.5; 






c[0]+=c2 [da-1]- 


a[0] *b [0] ; 




for ( i =1: i<da-l; i++) 




c il+= 0. 5* ( g [da-l+il + g [da-l-il-a [01 * b [ i 1 -a [ 


]*b[0]); 


} 







5.2 Optimized Code Using DCT and DFT 

Many groups and projects have been aheady involved in designing efficient implementations of 
discrete transforms such as DCT and DFT. We can cite for instance the Spiral project [20] and 
the FFTW library effort [H]. In order to benefit from the high efficiency of these works, we build 
our DCT/DFT based codes on top of the FFTW routines. For both DCT and DFT computations 
we use FFTW plans with FFTW_MEASURE planning option, which offer optimized code using 
runtime measurement of several transforms. 

As explained in the documentation of the FFTW library, the DCT-I transform using a pre/post 
processed real DFT suffers from numerical instability. Therefore, the DCT-I implementation in 
FFTW is using either a recursive decomposition in smaller optimized DCT-1 codelets or a real 
DFT of twice the size plus some scalings. For the latter case, this means that the complexity of 
the DCT-I code is not reflecting the one of [S] we used in our complexity analysis. Taking this 
into account, one should replace the 2n points DCT-I transforms of Section [3^ bv 4n points DFT 
transforms plus 2n multiplications by the real constant 2. This increases the complexity of the 
DCT-based method to : 

• 6nlog4n — 12n + 6 multiplications, 

• ISn log 4n — 30?! + 12 additions. 

In the following, we will denote this modification as the DCT-based (via FFT) method. In order 
to make a fair comparison with this DCT method which cares of stability, we shall mention that 
the Algorithm PM-chebyshev (DFT-based) as explained in section [4.41 mav suffer in practice from 
instability issues. Indeed, the trick to compute DFT2„(f) from DFT2„(&) needs multiplications by 
some powers of ujd and therefore introduces errors in the computed DFT. As we will see in the next 
section these errors will have an influence on the accuracy of the final product. An alternative to 
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Figure 3: Theoretical speedup of polynomial multiplication in Chebyshev basis. 




not introduce these numerical errors is therefore to really compute DFT2„ (r) by an FFT calculation, 
but at a price of more operations. This consideration will increase the theoretical complexity to: 



• 5n log 2n — 3n + 9 multiplications, 

• 15n log 2n — 17n + 18 additions. 

In the following, we will denote this modification as the PM-chebyshev (DFT-based accur). 

Considering practical stability issues and its impact on the complexity of the DCT-based and our 
DFT-based method, we can see in Figure|3]the effect on the theoretical speedups. In particular, our 
DFT-based reduction is becoming more efficient than the DCT-based when polynomial's degrees are 
getting larger. This is of course explained by the difference of the constant term in the complexity: 
20nlog2n for our method and 24nlog2n for the DCT-based (via FFT). 

5.3 Code Validation 

As a matter of reliability, we check the validity of all our implementations. First, we check their 
correctness by verifying the results of their implementations done in a symbolic way using Mapl€0 
software. 

Since we want to perform numerical computations, it is clear that the accuracy of the results 
may differ from one method to another. It is therefore crucial to investigate their stability to give 
good statement on the accuracy. It is not the intend of this work to give statements on the accuracy 

lwww.maplesoft.com 
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and this task would definitively require a dedicated work. However, in order to give a small insight 
we did some experiments to emphasis the relative error of every methods. Let us now give the 
definition of the relative error on polynomials as given in |14] . 



Definition 5.1. Let a{x), b{x) be polynomials given in Chebyshev basis with double precision floating 
point numbers coefficients. We define c{x) to be the approximation of the product a{x)b{x) using 
double precision computation (53 bits of mantissa) and c{x) to be the exact product computed over 
rational numbers. Using this notation, the relative error E{c{x)) is defined as 

\\c{x)\\2 

where ||...||2 represents the Euclidean norm of polynomials, i.e. ||a(a;)||2 ~ (X]fe=o'^fc)' where 
the Ok correspond to the coefficients of a{x). 



Following this definition, we have computed the relative error on polynomial products using 
polynomial inputs having random floating point entries. While the numerical results are computed 
in double precision floating point numbers, the exact product is computed using the arbitrary 
precision rational numbers of the GMF(1 library. The relative error is almost computed exactly since 
only the square root is using floating point approximations, the remaining parts being computed 
over the rationals. We propose in Figure |4] the measure of the relative error in our experiments. 
The ordinates axis gives the average relative error of 50 products with different random double 
precision floating point entries lying between —50 and 50. 

As explained in Section 15. 2[ we can see in this figure that Algorithm PM-chebyshev (DFT- 
based) is clearly suffering from instability issues. In these settings, the replacement of one DFT 
by few multiplications of complex number powers introduces too many errors, which causes a loss 
in the accuracy of the final result (e.g. up to three decimal digits can be erroneous). Hopefully, 
using Algorithm PM-chebyshev (DFT-based accur) completely avoids this issue and we get similar 
accuracy as for the DCT-based methods. 

If we change a little bit the settings of our experiment, taking only positive fioating point random 
entries (e.g. in [0,50]), the instability issue of Algorithm PM-chebyshev (DFT-based) seems to be 
less dramatic. This is illustrated in Figure [5] where we can see that relative error of Algorithm 
PM-chebyshev (DFT-based) only differ by at most one decimal digit. 

Undoubtedly, these experiments exhibit some stability issues in Algorithm PM-chebyshev (DFT- 
based) while it does not seem to be the case for the other methods based on discrete transforms. 
Algorithm PM-chebyshev (Karatsuba) also seems to have some numerical issues. This can be 
motivated by the nature of Karatsuba method which replaces one multiplication by few additions. 

From these experiments we can conclude few thoughts. Algorithm PM-chebyshev (DFT-based 
accur) seems to offer the same numerical behaviour as the DCT based method, and thus offer a 
concrete alternative in practice. If accuracy is not a problem. Algorithm PM-chebyshev (DFT- 
based) will provide an interesting option as it will increase efficiency, see section 15.41 Finally, a 
theoretical study of the numerical stability of all these methods need to be done to give precise 
statement on their reliability. 

•^http:/ /gmplib.org/ 
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Figure 4: Experimental measure of the relative error in double precision (Intel Xeon 2 GHz). Entries 
lying in [—50, 50] 
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Figure 5: Experimental measure of the relative error in double precision (Intel Xeon 2 GHz). Entries 
lying in [0, 50] 
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Figure 6: Practical performances of polynomial multiplication in Chebyshev basis against direct 
method - Intel Xeon 2GHz (global view). 




5.4 Benchmarks 

We now compare the practical efficiency of the different methods. We performed our benchmarks 
on an architecture which represents nowadays processors: an Intel Xeon processor 5130 running at 
2GHz with 2x4MB of L2 cache. We use the gcc compiler version 4.4.4 with 03 optimization. Even 
if the platform is multi-core, we did not use any parallel computations and the FFTW library has 
been built sequential. For each method, we measure the average running time of several polynomial 
multiplications. All the computations have been done with double precision floating point numbers 
and with the same data set. 

Remark 2. We only offer an average running time estimate of each algorithms since it is not 
realistic on nowadays processor to estimate precise running time of computation taking few mil- 
liseconds. 

We report in Figure[S]the relative performances to the Direct method implementation for polyno- 
mial sizes ranging from 2 to 8192. Both axis use logarithmic scale, and the ordinates axis represents 
the speedup against Direct method. All times used in this figure are given in the appendix. One 
can also find in Figure [7] of the appendix more detailed views of the Figure [6l 

As expected, one can see on these Figures that the Direct method reveals the most efhcient for 
very small polynomials (i.e. polynomial degrees less than 16). This is explained by the low number 
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of operations required by this method and its simphcity which makes possible several optimizations 
by the compiler (e.g. loop unrolling). When polynomial sizes are getting larger, the methods based 
on discrete transforms become the most efficients. In particular, we can see that DCT-based method 
is catching up with its version based on FFT, which clearly illustrates that DCT-I implementation 
of FFTW is using a double length FFT, as explained in Section 15.21 Therefore, as expected, our 
Algorithm PM-Chebyshev (DFT-based) is the most efficient with polynomial sizes greater than 32. 
In particular, our PM-Cliebyshev (DFT-based) implementation is gaining 20% to 30% of efficiency 
over the DCT-based implementation. Of course, if a more accurate result is needed one may prefer 
to use the PM-Chebyshev (DFT-based accur) version but at a price of less efficiency : only 15% of 
gain against DCT-based implementation for polynomials of size 8096. Surprisingly, for polynomials 
of size 32 and 64 this method reveals to be the most efficient in practice. It seems that the code of 
the DFT in FFTW library is more efficient for these sizes than our implementation of the algebraic 
trick described in section l44l 

Remark 3. One could have been interested to see the practical behavior of the method of Lima et. 
al II 7 1. However, our feelings on the efficiency of such a method lead us to be pessimistic. Even if 
this method decreases the number of multiplications, it increases the overall number of operations. 
Moreover, this method needs an important amount of extra memory (i.e. 0{n}°^'^^)) which defini- 
tively increases data access and then should considerably penalize performances. Furthermore, the 
method is quite complex, especially for the indices management in the separation procedure. Since 
no detailed algorithm is given in jl 71 it is not easy to make an implementation and then offer a fair 
comparison. 

Finally, from our benchmarks we observe that the performance of the Karatsuba multiplication 
does not compete with the Direct method for small polynomials (e.g. size less than 16). Adding the 
storage of intermediate value within Karatsuba procedure plus the extra quadratic operations needed 
by the method of Lima et. al \17l will probably make its implementation not competitive with other 
existing methods. 

6 A Note on Problems Equivalence 

Let us consider the problem of multiplying two polynomials given by their coefficients in a given 
basis of the M- vector space of R[a::]. We denote this problem in monomial basis as M,„on and the one 
in Chebyshev basis as Mche- Under this consideration, one can demonstrate the following theorem: 

Theorem 6.1. Problem Mmon and Mche are equivalent under a linear time transform, Mmon =L 
Mche and the constant of both transforms is equal to two. 

Proof. As we have shown in Section [3] the problem of multiplying polynomials in Chebyshev basis 
linearly reduces to the multiplication in monomial basis, and the constant in the reduction is two. 
Thus we have already demonstrate Mche <l Mmon- 

We can show that Mmon <l Mche by using (j4]). Indeed, we can see from (|4]) that the d + 1 
leading coefficients of the product in Chebyshev basis exactly match with the ones in monomial 
basis on the same input coefficients. It is easy to show that the remaining d coefficients can be read 
from the product in Chebyshev basis of the reversed inputs. 
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Let us denote Xc the multiplication in Chebyshev basis and x the one in monomial basis. 
Consider the two polynomials a, 6 G ]R[a;] given in monomial basis as 

d d 

a{x) = ^^akx'' and b{x) = hkX^ . 
k=o fe=o 

Consider the polynomials a{x) , b{x) , a{x) and j3{x) sharing the same coefficients as a(x) and b{x) 
but expressed in Chebyshev basis: 

d d 

"(^) = '^(^kTk{x) , b{x) = bkTk{x), 

k=0 k=0 
d d 

"(^) = ^f^d-kTkix) , = ^6d_feTfc(x). 

fe=0 fe=0 

The coefficients of the polynomial c{x) = a{x) x b{x) expressed in monomial basis can be read 
from the coefficients of the polynomials 

f{x) = a{x) Xc b{x) and g{x) = a{x) Xc(3{x) 

using the relation 

{Qd-i-k for fc = 0...d- 1, 
fk for k = d . . . 2d. 

This clearly demonstrates that Mmon <l Mche and thus complete the proof. □ 



7 Conclusion 

We described yet another method to reduce the multiplication of polynomials given in Chebyshev 
basis to the multiplication in the monomial basis. Our method decreases the constant of the problem 
reduction and therefore offer a better complexity than the ones using basis conversions. Moreover, 
since our method does not rely on basis conversions, it might offer more numerical stability as 
it could be when converting coefficients to other basis. As we already mention, the problem of 
numerical stability is of great interest and should be treated as a dedicated article. 

Our PM-Chebyshev algorithm offers an efficient alternative to any existing quasi-linear algo- 
rithms. In particular, it allows to use Fast Fourier Transform of half length of the one needed 
by the specialized DCT-based method, which is an alternative when DCT codes are not available 
or sufficiently efficient. In such a case, our method achieves the best performances among all the 
available method for large degree polynomials. 

Although our reduction scheme using Karatsuba's method is not as efficient as one could have 
expected for polynomial of medium size, further work to optimize its implementation should be 
investigated. 

Finally, our attention in this work has been focused only on polynomials of R[a;] but our approach 
is still valid for Chebyshev polynomials defined over other domains, as Dickson polynomials over 
finite fields for example. 
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Figure 7: Practical performances of polynomial multiplication in Chebyshev basis against direct 
method - Intel Xeon 2GHz (partial view). 
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Table 4: Times of polynomial multiplication in Chebyshev basis (given in /xs) on Intel Xeon 2GHz platform. 



n 


Direct 


DCT-based 


DCT-based (FFT) 


PM-Cheby (Kara) 


PM-Cheby (DFT) 


PM-Cheby (DFT accur) 


2 


0.18 


1.08 


0.38 


0.39 


0.57 


0.46 


4 


0.28 


1.15 


0.48 


0.58 


0.66 


0.54 


8 


0.57 


1.58 


0.74 


0.80 


0.93 


0.80 


16 


1.13 


2.43 


1.47 


1.56 


1.52 


1.38 


32 


3.73 


4.33 


2.65 


4.74 


2.75 


2.59 


64 


13.44 


7.56 


8.11 


14.93 


5.09 


4.94 


128 


50.06 


15.76 


14.04 


61.68 


12.84 


15.52 


256 


185.48 


32.29 


29.69 


171.78 


23.58 


24.70 


512 


716.51 


69.00 


62.13 


489.29 


52.46 


57.07 


1024 


2829.78 


146.94 


135.47 


1427.82 


104.94 


112.40 


2048 


11273.20 


304.55 


317.35 


4075.72 


234.41 


249.88 


4096 


47753.40 


642.17 


679.50 


12036.00 


520.56 


566.43 


8192 


194277.00 


1397.42 


1437.42 


35559.60 


1125.40 


1185.41 



PM-Cheby stands for PM-Chebyshev algorithm. 



